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Abstract 

We consider the problem of adaptive estimation of the regression function 
in a framework where we replace ergodicity assumptions (such as independence 
or mixing) by another structural assumption on the model. Namely, we pro- 
pose adaptive upper bounds for kernel estimators with data-driven bandwidth 
(Lepski's selection rule) in a regression model where the noise is an increment of 
martingale. It includes, as very particular cases, the usual i.i.d. regression and 
auto-regressive models. The cornerstone tool for this study is a new result for 
self-normalized martingales, called "stability", which is of independent interest. 
In a first part, we only use the martingale increment structure of the noise. We 
give an adaptive upper bound using a random rate, that involves the occupation 
time near the estimation point. Thanks to this approach, the theoretical study 
of the statistical procedure is disconnected from usual ergodicity properties like 
mixing. Then, in a second part, we make a link with the usual minimax theory 
of deterministic rates. Under a /3-mixing assumption on the covariates process, 
we prove that the random rate considered in the first part is equivalent, with 
large probability, to a deterministic rate which is the usual minimax adaptive 
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1 Introduction 

1.1 Motivations 

In the theoretical study of statistical or learning algorithms, stationarity, ergodicity 
and concentration inequalities are assumptions and tools of first importance. When 
one wants to obtain asymptotic results for some procedure, stationarity and ergodic- 
ity of the random process generating the data is mandatory. Using extra assumptions, 
like moments and boundedness conditions, concentration inequalities can be used to 
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obtain finite sample results. Such tools are standard when the random process is 
assumed to be i.i.d., like Bernstein's or Talagrand's inequality (sec [20], [31] and [28], 
among others). To go beyond independence, one can use a mixing assumption in order 
to "get back" independence using coupling, see [9], so that, roughly, the "indepen- 
dent data tools" can be used again. This approach is widely used in nonparametric 
statistics, statistical learning theory and time series analysis. 

The aim of this paper is to replace stationarity and ergodicity assumptions (such 
as independence or mixing) by another structural assumption on the model. Namely, 
we consider a regression model where the noise is an increment of martingale. It 
includes, as very particular cases, the usual i.i.d. regression and the auto-regressive 
models. The cornerstone tool for this study is a new result, called "stability", for 
self-normalized martingales, which is of independent interest. In this framework, we 
study kernel estimators with a data-driven bandwidth, following the Lcpski's selection 
rule, sec [22], [24]. 

The Lepski's method is a statistical algorithm for the construction of optimal 
adaptive estimators. It was introduced in [21, 22, 23], and it provides a way to 
select the bandwidth of a kernel estimator from the data. It shares the same kind 
of adaptation properties to the inhomogeneous smoothness of a signal as wavelet 
thresholding rules, see [24]. It can be used to construct an adaptive estimator of a 
multivariate anisotropic signal, see [18], and recent developments shows that it can 
be used in more complex settings, like adaptation to the semi-parametric structure of 
the signal for dimension reduction, or the estimation of composite functions, see [13], 
[17]. In summary, it is commonly admitted that Lepski's idea for the selection of a 
smoothing parameter works for many problems. However, theoretical results for this 
procedure are mostly stated in the idealized model of Gaussian white noise, excepted 
for [12], where the model of regression with a random design was considered. As far 
as we know, nothing is known on this procedure in other settings: think for instance 
of the auto-regressive model or models with dependent data. 

Our approach is in two parts: in a first part, we consider the problem of estimation 
of the regression function. We give an adaptive upper bound using a random rate, 
that involves the occupation time at the estimation point, see Theorem 1. In this first 
part, we only use the martingale increment structure of the noise, and not stationarity 
or ergodicity assumptions on the observations. Consequently, even if the underlying 
random process is transient (e.g. there are few observations at the estimation point), 
the result holds, but the occupation time is typically small, so that the random rate 
is large (and eventually not going to zero as the sample size increases). The key tool 
is a new result of stability for self- normalized martingales stated in Theorem 2, see 
Section 3. It works surprisingly well for the statistical application proposed here, but 
it might give new results for other problems as well, like upper bounds for procedures 
based on minimization of the empirical risk, model selection (sec [26]), etc. In a second 
part (Section 4), we make a link with the usual minimax theory of deterministic rates. 
Using a /3-mixing assumption, we prove that the random rate used in Section 2 is 
equivalent, with a large probability, to a deterministic rate which is the usual adaptive 
minimax one, see Proposition 1. 

The message of this paper is twofold. First, we show that the kernel estimator and 
Lcpski's method are very robust with respect to the statistical properties of the model: 
they does not require stationarity or ergodicity assumptions, such as independence or 
mixing to "do the job of adaptation" , see Theorem 1. The second part of the message 
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is that, for the theoretical assessment of an estimator, one can use advantageously a 
theory involving random rates of convergence. Such a random rate naturally depends 
on the occupation time at the point of estimation (=the local amount of data), and 
it is "almost observable" if the smoothness of the regression were to be known. An 
crgodicity property, such as mixing, shall only be used in a second step of the theory, 
for the derivation of the asymptotic behaviour of this rate (see Section 4). Of course, 
the idea of random rates for the assessment of an estimator is not new. It has already 
been considered in [15, 14] for discrete time and in [8] for diffusion models. However, 
this work contains, as far as we know, the first result concerning adaptive estimation 
of the regression with a martingale increment noise. 



1.2 The model 

Consider sequences (Xk)k>o and (Yfc)fc>i of random variables respectively in M. d and 
R, both adapted to a filtration (^k)k>o, and such that for all k > 1: 

Y k = f(X k ^)+e k , (1) 

where the sequence (efe)fe>i is a (j^t)- martingale increment: 

E(|e fc ||^ fc _i) < oo and E{e k \& k -i) = 0, 

and where / : R d — > R is the unknown function of interest. We study the problem 
of estimation of / at a point x £ R d based on the observation of (Y 1; . . . , Yjy) and 
(Xq, . . . , -Xjv-i), where N > 1 is a finite (J^fc)-stopping time. This allows for "sample 
size designing" , see Remark 1 below. The analysis is conducted under the following 
assumption on the sequence (£k)k>i~- 

Assumption 1. There is a (J^t) -adapted sequence (o~k)k>o } assumed to be observed, 
of positive random variables and /z,7 > such that: 



E 



~k-l 



< 7 Vfc > 1. 



This assumption means that the martingale increment e kl normalized by crk-i, 
is uniformly subgaussian. In the case where e k is Gaussian conditionally to J^fc_i, 
Equation (1) is satisfied if (a k ) is such that Var (£fc|J^fe_i) < ca\_ 1 for any fc > 0, 
where c > is a deterministic constant not depending on k. If one assumes that 
Var (efc|^fe_i) < a 2 for a known constant a > 0, one can take simply a k = a. Note 
that Cfc-i is not necessarily the conditional variance of e k , but an observed upper 
bound of it. 

Particular cases of model (1) are the regression and the auto-regressive model. 
Example 1. In the regression model, one observes (Y k , Xk-i) k=1 satisfying 

Y k = /(x fc -i) + s(*fc-i)a, 

where is i.i.d. centered, such that E(exp(/x(j?)) < 7 and independent of = 
a(X , ...,Xk), and where / : R d — > R and s : R d -> M+. This model is a particular 
case of (1) with a\ > s(Xk) 2 . 
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Example 2. In the auto- regressive model, one observes a sequence (X k )^_ in M. d 
satisfying 

X k = /(* fc _i) + S(X k -t)£ k , (2) 

where / = . . . , f d ) : R d -> K d , where 5 : M d -»■ M dxd and where ( k = (C M , . . . , Cm) 
is a sequence of centered i.i.d. vectors in K d independent of Xq, with covariance 
matrix Id and such that E(exp(/x(^ ^)) < 7- The problem of estimation of each co- 
ordinate fj is a particular case of (1) with Y k = (X k )j, & k = <r(Xo, Cii • • • i Cfc) an d 
^ > % J '(^) 2 - 

Let us mention that these two examples are very particular. The analysis con- 
ducted here allows to go way beyond the i.i.d. case, as long as (£fc) is a martingale 
increment. 

Remark 1. The results given in Section 2 are stated in a setting where one observes 
(Xk-i,Yk) k=1 with N a stopping time. Of course, this contains the usual case N = 
n, where n is a fixed sample size. This framework includes situations where the 
statistician decides to stop the sampling according to some design of experiment rule. 
This is the case when obtaining data has a cost, that cannot be more than a maximum 
value, for instance. 

Remark 2. Note that while £ k = Efe/cfe-i is conditionally subgaussian, is not in 
general, (see [6] for examples). 



1.3 The Lepski's method 

In what follows, \x\ stands for the Euclidean norm of £ € M. d . An object of importance 



in the analysis conducted below is the following. For h > 0, we define 

jv 

i 

^\X h -i-x\<h, 



N 1 



fe=l 

which is the occupation time of (X k ) k > at x rcnormalizcd by (<J k ). Then, if h is 
such that L(h) > (there is at least one observations in the interval [x — h,x + h]), 
we define the kernel estimator 

1 N 1 

m = 1 ^ ) ^^-i\x i _ 1 - x \< h Y k . 

Let (hi)i>o be a decreasing sequence of positive numbers, called bandwidths, and 
define the following set, called grid, as 

H := {hj : L(hj) > 0}. 

For the sake of simplicity, we will consider only on a geometrical grid, where 

hj = hog 3 

for some parameters ho > and g £ (0, 1). The Lepski's method selects one of the 
bandwidths in %. Let b > and for any h > 0, define 

1>(h) :=l + blog(h /h). 
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For u > 0, define, on the event {L(ho) 1 / 2 < it}, the bandwidth 

H u = min {hen: V2 < u}, (3) 

and let uq > 0. The estimator of f(x) is /(-ff) denned on the set {L(/to) -1//2 < 
where H is selected according to the following rule: 

H := max {fr <E H : h > H Uo and W e [F Uo ,/i]nM, 

|/W _ /M | S „(|W)-}, (4, 

where ^ is a positive constant. This is the standard Lepski's procedure, see [22, 23, 
24, 25]. In the next Section, we give an upper bound for f(H), with a normalization 
(convergence rate) that involves L(h). This result is stated without any further 
assumptions on the model. 

Remark 3. The number uq is a fixed constant such that the largest bandwidth ho in 
the grid satisfies L(h$)~~ x l 2 < uq. This deterministic constraint is very mild: if we 
have some data close to x, and if ho is large enough (this is the largest bandwidth in 
the grid), then L(ho) should be large, at least such that L(/io) -1 ^ 2 < uo- Consider 
the following basic example: Xk E [— 1, l] d almost surely for any k and a% = 1, then 
by taking ho = Vd and uq = 1 the event {L(/io) _1 ^ 2 < uq} has probability one. In 
Section 4 (see Proposition 1) we prove that a mixing assumption on (Xk)k>o entails 
that this event has an overwhelming probability. 



2 Adaptive upper bound 

The usual way of stating an adaptive upper bound for /(i?), see for instance [24], is 
to prove that it has the same convergence rate as the oracle estimator f(H*), which 
is the "best" among a collection {f(h) : h € T-L}. The oracle bandwidth H* realizes 
a bias- variance trade-off, that involves explicitly the unknown /. For h e % define 

\ * \ 

■■= T7 TTY l — 1 \X k - 1 - X \<hf(Xk-l). (5) 

h[n > fc=l 

Consider a family of non-negative random variables (W(h); h € %) that bounds from 
above the local smoothness of / (measured by its increments): 

sup \f(h')-f(x)\ < W(h), VheH. (6) 

h'E[H U0 ,h]C\U 

Nothing is required on (W(h) : h € %) for the moment, one can perfectly choose it 
as the left hand side of (6) for each h G T~L for instance. However, for the analysis 
conducted here, we need to bound W from below and above (sec Remark 5): introduce 

W(h) := [W{h) V (6 Q (h/ho) aa )} A uo, (7) 

where So and ao are positive constants. On the set 

{L(ho)- 1/2 <W(h )}, 
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define the random oracle bandwidth 



H* := min 




(8) 



and consider the event 



SI' := {L(h y 1/2 < W(h ),W(H*) < u Q }. 



The event Q' is the "minimal" requirement for the proof of an upper bound for f(H), 
see Remarks 5 and 6 below. 



Theorem 1. Let Assumption 1 hold and let f(H) be the procedure given by the 
Lepski's rule (4). Then, for any p 6 (0, bpv 2 / (64ao(l + 7)))j we have 



w{{\f(H) - f(x)\ > tw(H*)} n n'] < c, 




for any t > to, where Co, to > are constants depending on p, p, 7, g, b, Uq, 8q, oiq, v ■ 

The striking fact in this Theorem is that we don't use any stationarity, ergodicity 
or concentration property. In particular, we cannot give at this point the behaviour of 
the random normalization W(H*). It does not go to in probability with N — > +00 
when L(Jiq) does not go to +00 in probability, which happens if (Xk)k>o is a transient 
Markov chain for instance. Hence, without any further assumption, Theorem 1 does 
not entail that f(H) is close to f(x). On the other hand, when (Xk)k>o is mixing, we 
prove that W(H*) behaves as the deterministic minimax optimal rate, see Section 4. 
The cornerstone of the proof of this Theorem is a new result concerning the stability 
of self-normalized martingales, see Theorem 2 in Section 3 below. 

Remark 4. The parameter p of decay of the probability in Theorem 1 is increasing 
with the threshold parameter v from (4). So, for any p > and v large enough, 
Theorem 1 entails that the expectation of (W(H*)~ 1 \f(H) — /(x)|) p ln< is finite. 

Remark 5. The definition of W is related to the fact that since nothing is required 
on the sequence (Xk), the occupation time L(h) can be small, even if h is large. In 
particular, L(h) has no reason to be close to its expectation. So, without the intro- 
duction of W above, that bounds from below W by a power function, we cannot give 
a lower estimate of H* (even rough), which is mandatory for the proof of Theorem 1. 

Remark 6. On the event fl' , we have {L(/io) _1/ ' 2 < W(ho)}, meaning that the band- 
width ho (the largest in %) is large enough to contain enough points in [x — ho, x + ho], 
so that L(ho) > W(ho) 2 - This is not a restriction when W(h) = Lh s [f has a local 
Holder exponent s] for instance, see Section 4. 

Remark 7. In the definition of f(H), we use kernel estimation with the rectangular 
kernel K{x) = lr_i i x](a;)/2. This is mainly for technical simplicity, since the proof of 
Theorem 1 is already technically involved. Consequently, Theorem 1 does not give, 
on particular cases (see Section 4), the adaptive minimax rate of convergence for 
regression functions with an Holder exponent s larger than 1. To improve this, one 
can consider the Lepski's method applied to local polynomials (LP) (see [12], and 
see [10] about (LP)). This would lead, in the framework considered here, to strong 
technical difficulties. 



G 



3 Stability for self-normalized martingales 



We consider a local martingale (M n ) n ^ with respect to a filtration (<? n )„ e N) and 
for n > 1 denote its increment by A.M n := M n — M n _i. The predictable quadratic 
variation of M n is 

n 

(A/)„:=^E[AA4 2 |g fc _ 1 ]. 

fe=i 

Concentration inequalities for martingales have a long history. The first ones are the 
Azuma-Hocffding's inequality (see [1], [16]) and the Freedman's inequality (see [11]). 
The latter states that, if (M n ) is a square integrable martingale such that |AMfc| < c 
a.s. for some constant c > and Mo = 0, then for any x, y > 0: 

2 

P[M„ > x, (M) n < y] < exp ( - * ) . (9) 

Later on, an alternative to the assumption |AMfc| < c was proposed. This is the 
so-called Bernstein's condition, which requires that there is some constant c > such 
that for any p > 2: 

5>[|AM fe HS fe -i] <|c^ 2 (M)„, (10) 
fc=i 

see [7], and [27]. In [30] (sec Chapter 8), inequality (9) is proved with (M) n replaced 
by a £/„_i-measurablc random variable nR^, under the assumption that 

^EflAMfcHfifc-!] ^I^- 2 ^ (11) 
fc=i 

holds for any p > 2. There are many other very recent deviation inequalities 
for martingales, in particular inequalities involving the quadratic variation [M] n = 
Y^k=i A-Mfc, see for instance [7] and [4]. 

For the proof of Theorem 1, a Bernstein's type of inequality is not enough: note 
that in (9), it is mandatory to work on the event {(M) n < y}. A control of the 
probability of this event usually requires an extra assumption on (Xk)k>o, such as 
independence or mixing (see Section 4), and this is precisely what we wanted to avoid 
here. Moreover, for the proof of Theorem 1, we need a result concerning Mt, where 
T is an arbitrary finite stopping-time. 

In order to tackle this problem, a first idea is to try to give a deviation for the self- 
normalized martingale Mt / y (M) t ■ It is well-known that this is not possible, a very 
simple example is given in Remark 8 below. In the next Theorem 2, we give a simple 
solution to this problem. Instead of Mt/ -J (M)t, we consider y/aMT/{a + (M)t), 
where a > is an arbitrary real number, and we prove that the exponential moments 
of this random variable arc uniformly bounded under Assumption 2 below. The result 
stated in Theorem 2 is of independent interest, and we believe that it can be useful 
for other statistical problems. 

Assumption 2. Assume that Mo = and that 

AM n = S„_iCr, (12) 
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for any n > 1, where (s„)„gN *s « (Q n )-adapted sequence of random variables and 
(Cn)n>i * s a sequence of (Q n ) -martingale increments such that for a = 1 or a = 2 
and some /i > 0, 7 > 1: 

E[exp(M|a| a )l&-i] <7 for any k>\. (13) 

Let us define 



n 

fe=i 

■2 



Note that if (( n )n>i is a conditionally normalized sequence (ie E(£„|£/ n _i) = 1) 
then (12) entails that V n = (M) n . Moreover, if Assumption 2 holds, we have (M) n < 
c^Vn for any n > 1 with c M = In 2//x when a = 2 and = 2//z 2 when a = 1. Denote 
cosh(x) = (e x + e~ x )/2 for any x £ M. 

Theorem 2. Lei Assumption 2 holds. 

• // a = 2, we have for any A G [0, 9(i^_ 7 ) ), any a > and any finite stopping-time 
T: 

w/iere ca := exp( 2(1 ^ r ^ )(cxp(Ar A ) - 1) and T x := s^aJ- 

• If a = 1, we have for any A G (— /!, /1), any a > and any finite stopping-time T: 



E 



cosh(A^ MT 



<l+c A , (15) 



wftere c' x = (7- 1)A 2 cxp(( 7 - 1)A 2 /^ 2 ) cosh(2 log 2 + 2(7 - 1)X 2 / fi 2 ) / fi 2 . 

The proof of Theorem 2 is given in Section 5. Theorem 2 shows that when ^ is 
subgaussian (resp. sub-exponential) conditionally to Gk-i, then •Jo\Mt\I '(a + Vr) is 
also subgaussian (resp. sub-exponential), hence the name stability. Indeed, we cannot 
expect an improvement in the tails of y / a|Mr|/(a + Vt) due to the summation, since 
the Sk-i are arbitrary (for instance, it can be equal to zero for every k excepted for 
one). 

Remark 8. It is tempting to take "a = Vr" in Theorem 2. However, the following 
basic example shows that it is not possible. Take (Bt)t>o a standard Brownian 
motion, consider M n = B n and define the stopping time T c = inf{n > 1 : B n / \/ri > 
c}, where c > 0. For any c > 0, T c is finite a.s. (use the law of iterated logarithm for 
instance). So, in this example, one has Mt c / yj (M)t c = Mt c /\/T~c > c, for any c > 0. 



4 Consistency with the minimax theory of deter- 
ministic rates 

In this Section, we prove that, when (Xk)k>o is mixing, then Theorem 1 gives the 
adaptive minimax upper bound. Let us consider again sequences (Xk)k>o and (Yk)k>i 
of random variables satisfying (1), where (ek)k>o an (j^fc)/c>o-martingale increment. 
For the sake of simplicity, we work under the following simplified version of Assump- 
tion 1. 
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Assumption 3. 



There is a known a > and (j,, 7 > such that: 

,J2 



E 



cxp I ^fc_l 



< 



7 



Vfc > 1. 



Moreover, we consider the setting where we observe (Yi, . . . , Y n ) and {Xq, . . . , X n _i), 
namely the stopping-time N is simply equal to n (the results in this section are proved 
for n large enough). Note that in this setting, we have L(h) = a~ 2 Y^k=x l\x k -i-x\<h- 
We assume also that (Xi~)k>o is a strictly stationary sequence. 



4.1 Some preliminaries 

A function £ : M + — >• M + is slowly varying if it is continuous and if 

lim £(yh)/£(h) = 1, Viv > 0. 

h— >0+ 

Fix t £ M. A function g : M + — > M + is r-regularly varying if g(y) = y T £(y) for some 
slowly varying t. Regular variation is a standard and useful notion, of importance in 
extreme values theory for instance. We refer to [5] on this topic. 

Below we will use the notion of /3-mixing to measure the dependence of the se- 
quence (Xk)k>o- This measure of dependence was introduced by Kolmogorov, see 
[19], and we refer to [9] for topics on dependence. Introduce the c-field 3C£ = cr(Xk : 
u < k < v), where u, k, v are integers. A strictly stationary process (Xk)k<az is called 
/3-mixing or absolutely regular if 

1 11 

(3 q :=-sup(^^ \¥[Ui n Vj] - P[t/i]P[V,-]|) -> as q -> +00, (16) 
i=i j=i 

where the supremum is taken among all finite partitions (Ui)f =1 and (Vj)j =1 of f2 
that are, respectively, ^2.^ and measurable. This notion of dependence is 

convenient in statistics because of a coupling result by Berbee, see [3], that allows to 
construct, among /3-mixing observations, independent blocks, on which one can use 
Bernstein's or Talagrand's inequality (for a supremum) for instance. This strategy has 
been adopted in a series of papers dealing with dependent data, see [32, 2, 29] among 
others. In this section, we use this approach to give a deterministic equivalent to the 
random rate used in Section 2. This allows to prove that Theorem 1 is consistent 
with the usual minimax theory of deterministic rates, when one assumes that the 
sequence (Xk)k>o is /3-mixing. 



4.2 Deterministic rates 

We assume that / has Holder-type smoothness in a neighbourhood of x. Let us fix 
two constants Sq,uo > and recall that ho is the maximum bandwidth used in the 
Lepski's procedure (see Section 1.3). 

Assumption 4 (Smoothness of /). There is < s < 1 and a slowly varying function 
£ w such that the following holds: 

sup \f(y) — f(x)\ < w(h), where w(h) := h s £ w (h) 

y:\y-x\<h 

for any h < hg, w is increasing on [0,h ], w(h) > So(h/h ) 2 and w(h) < u for any 
h e [0,h ]. 
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This is slightly more general than an Holder assumption because of the slowly 
varying term t w . The usual Holder assumption is recovered by taking £ w = r, where 
r > is some constant (the radius in Holder smoothness). 

Under Assumption 4, one has that 

sup \f{h')-f(x)\ <w{h) VheH. 

h'£[H U0 ,h]nH 

Under this assumption, one can replace W by w in the statement of Theorem 1 and 
from the definition of the oracle bandwidth H* (sec (8)). An oracle bandwidth related 
to the modulus of continuity w can be defined in the following way: on the event 

Qo = {L(ho)~ 1/2 < w(h )}, 

let us define 

H w :=mm{/iG]0,/io] : (^f' 2 < w(h)}. (17) 

Under some ergodicity condition (using /3-mixing) on (Xk)k>o, we are able to give 
a deterministic equivalent to w(H w ). Indeed, in this situation, the occupation time 
L{h) concentrates around its expectation ¥,L(h), so a natural deterministic equivalent 
to (17) is given by 

h w :=min{he)0,h ] : (j^y) ^ < w(h)}. (18) 

Note that h w is well defined and unique when (Ei(/io))~ 1 ^ 2 < w(ho), ie when n > 
& 2 /{Px([x — h , x + h ])w(ho) 2 ), where Px stands for the distribution of X - We are 
able to give the behaviour of h w under the following assumption. 

Assumption 5 (Local behaviour of Px)- There is r > —1 and a slowly varying 
function £x such that 

P x ([x -h,x + h)) = h T+1 £ x (h) V/i < h . 

This is an extension of the usual assumption on Px which requires that it has a 
continuous density fx wrt the Lebesgue measure such that fx{%) > (see also [12]). 
It is met when fx{y) = c\y — x\ r for y close to x for instance (in this case £x is 
constant). 

Lemma 1. Grant Assumptions 4 and 5. Then h w is well defined by (18) and unique 
when n is large enough and such that 

h w = {^/nY'^+^H^/n) andw{h w ) = (a 2 /n) s ' {2s+T+1 h 2 {a 2 /n), 

where £\ and £<i are slowly varying functions that depend on s,r and £x, £ w - 

The proof of this lemma easily follows from basic properties of regularly varying 
functions, so it is omitted. Explicit examples of such rates arc given in [12]. Note 
that in the i.i.d. regression setting, we know from [12] that w{h w ) is the minimax 
adaptive rate of convergence. Now, under the following mixing assumption, we can 
prove that the random rate w(H w ) and the deterministic rate w(h w ) have the same 
order of magnitude with a large probability. 
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Assumption 6. Let ((3 q ) q >i be the sequence of (3-mixing coefficients of (Xk)k>o, 
see (16), and let 77, k > 0. We assume that for any q > 1: 

where ip( u ) = ?7(log it) K (geometric mixing) or ip(u) = rju K (arithmetic mixing). 

Proposition 1. Let Assumptions 4, 5 and 6 hold. On 0,$, let H w be given by (17) 
and let (for n large enough) h w be given by (18). Then, if (Xk) is geometrically 
(3-mixing, or if it is arithmetically (3-mixing with a constant n < 2s/(t+ 1), we have 

P J^hd. < W {H W ) <4iy(/i IU )}nn >l-cp» and P[fl§] = o(p») 

/or n Zarge enough, where in the geometrically (3-mixing case: 

2s 



(p n = cxp(— Cin Sl £i(l/n)) where Si = 



2s / 

ip n = C 2 n~ S2 £ 2 (l/n) where S 2 = - — — ( : 

is + r + 1 V/ 



(2s + r+l)(«+l) 

and in the arithmetically (3-mixing case: 

2s (1 T + l' 
~2s~ 

where C\,C 2 are positive constants and £i,£ 2 are slowly varying functions that de- 
pends on r), k, r, s, a and £x, £ w - 

The proof of Proposition 1 is given in Section 5 below. The assumption used in 
Proposition 1 allows a geometric /3-mixing, or an arithmetic /3-mixing, up to a certain 
order, for the sequence (Xk)' This kind of restriction on the coefficient of arithmetic 
mixing is standard, see for instance [29, 32, 2]. 

The next result is a direct corollary of Theorem 1 and Proposition 1. It says that 
when (Xk)k>o is mixing, then the deterministic rate w(h w ) is an upper bound for the 
risk of f(H). 

Corollary 1. Let Assumptions 3, 4 an d 5 hold. Let Assumption 6 hold, with the 
extra assumption that k < 2s/(s + t + 1) in the arithmetical (3-mixing case. Moreover, 
assume that \f(x)\ < Q for some known constant Q > 0. Let us fix p > 0. If v > 
satisfies b[ns 2 > 128p(l + r) (recall that v is the constant in front the threshold in the 
Lepski's procedure, see (A)) then we have 

E[\f(H) - f(x)\P] < c lW (h w y 

for n large enough, where f(H) = — Q V f(H) A Q and where C± > depends on 
q,PiS,fi,7,b, u ,8 ,v,Q. 

The proof of Corollary 1 is given in Section 5 below. Let us recall that in the i.i.d 
regression model with gaussian noise, we know from [12] that w(h w ) is the minimax 
adaptive rate of convergence. So, Corollary 1 proves that Theorem 1 is consistent 
with the minimax theory of deterministic rates, when (Xk) is /3-mixing. 
Example 3. Assume that / is s-H61der, ie Assumption 4 holds with w(h) = Lh s so 
£ w (h) = L and assume that Px has a density fx which is continuous and bounded 
away from zero on [a; — ho,x + ho], so that Assumption 5 is satisfied with r = 0. 
In this setting, one easily obtains that w(h w ) is equal (up to some constant) to 
(\ogn/n) s /( 2s+1 \ which is the pointwise minimax adaptive rate of convergence, see 
[25, 23, 24] for the white-noise model and [12] for the regression model. 
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5 Proof of the main results 

5.1 Proof of Theorem 2 for a = 2 

Let a > and A € [0, 2 (i+ 7 ) )- Define Y := and for n > 1: 

r„ := and ff„ := E[cxp(A(y„ - r n _ x )) | 0„_i]. 

[(X + K n J 

Assume for the moment that H n is finite a.s, hence we can define the local martingale 

n 

S„:=£>^-(e^-^) -H k ), 
fc=i 

so that 



exp(AY n ) = 1 + e^" 1 (e^ Y "- Yk -^ - l) 
fe=i 

n 

= l + S n + J2e XYk - 1 (H k -l)- 



fc=i 

Using the sequence of localizing stopping times 

n+l 



Xp := min 



{n>0:^E(e Ayfc |£ fc -i) > p] 



fc=l 



for p > 0, the process (S n/ \T p ) n >o is a uniformly integrable martingale. So using 
Fatou's Lemma, one easily gets that 



TAT„ 



E{e XYT ) < liminfE(e AFTAT p) < liminf (l + E(S* TAT ) +e( V e XY "-HH h - !))) 

p— >+oo p— >+oo I p \ / J 

A'— 1 

TAT P 

= 1 + liminf E( e XYh - 1 {Hk - !))■ 



fe=i 

This entails (14) if we prove that 



J2e XY "-i(H k -l)<c x (19) 
for all n > 1. First, we prove that 



H n < exp 



r A + ■ (2Ar A -i) , (20) 



(a + V n ) 2 \ a + V n -i 
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which entails that H n is finite almost surely. We can write 

Ml - MU (a + V n ^) 2 -(a + V n ) 2 

in — ln-1 — »— T2 r aj\i„_ 1 — -g- -r^— 

(a + K) (a + K-i) 



(M n - M n _i) + 2Af n _i (Af n - M n _i) 

_ aM^_ 1 4_ 1 (2a + K-i + K) 
(a + K) 2 (a + K-i) 2 



< 



a(Sn-lCn + 2M n _lS n _lCr 1 



2 a Af 2 _ 1 s 2 _ 1 
(a + ^) 2 (a + 14-i) 



(a + Ki) 

where we used that V n -\ <V n . In other words 

exp(A(F„ - F„_i)) < exp (/x„Cn + Pnd - 

with: 

A«4-i _ 2Xas n - 1 M n - 1 2Xasl_ 1 Ml_ 1 



(a + V n 



(a+v n y 



(a + V n ) (a + V n -i 



The random variables /i„, p„ and S n are C/„_i-measurable and one has < fi„ < X. 
We need the following Lemma. 

Lemma 2. Let Q be a real random variable such that E[£] =0 and such that 

E[exp(/iC 2 )] < 7 

/or some fi > and 7 > 1. Then, for any p£R and m € [0, /i), we ftai/e 

-(l + 2 7 )( / 9 2 +7n)- 



E[e mC +pc ] < exp 



2(/x - m) 



The proof of this Lemma is given in Section 6. Conditionally to Gn-i, we apply 
Lemma 2 to £„. This gives 

H n < E[exp(/^„C I 2 l + PnCn - S n ) I </„_].] < CXp(Tx{p n + fJ-n) - <5n) , 

that can be be written 



H n < exp 



Xas 2 l _ 1 

(a + v n y 



Fa + 2M 7 U 



2XT x a 



(a + V n ) 2 a + V n -i 



which yields (20) using a/ (a + V n ) 2 < l/(o+ V n -i). Since A < /J,/ [2(1 + 7)], we have 
2AL A - 1 < 0, so (20) entails 



H n -1< exp 



[ XT \as 2 l _ 1 i 



(a + V n f 



1 < (exp (AT A ) - 1) 



'n-l 



(a + K) 2 



where we used the fact that e px — 1 < (e M — l)x for any x G [0, 1/2], and n > 0. Note 
that (20) entails also the following inclusion: 



{H n > 1} C 



2M,U < T X 



a + V n -x 1-2XT > 



C <! e^"- 1 < exp 



Ar> 



2(l-2Ar A ), 
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It follows that 

n no 



so (19) follows, since 

n *V„ 
^(a + ^) 2 -i (a + z) 2 " 

This concludes the proof of (14) for a = 2. □ 
5.2 Proof of Theorem 2 for a = 1 

First, note that (13) and the fact that the (k are centered entails that for any |A| < /i, 
we have 

E[exp(ACfc) | Gk-i] < cxp^'A 2 ) (21) 

for any k > 1, where // = (7 — 1)/ pt 2 . Now, we use the same mechanism of proof as 
for the case a = 2. Let a > and A € (— [i 1 /1) be fixed. Define 

Y n = and tf„ = E[cosh(Ar„) - cosh(Ar„_i) | Q n _^]. 

a+V n 

Assuming for the moment that H n is finite almost surely, we define the local martin- 
gale 



S„ := ^(cosh(Ar fc ) - coshCAF^O - H k 



k=l 

Thus, inequality (15) follows if we prove that for all n > 1: 

cosh(AY"„) < 1 + S n + //A 2 cxp(//A 2 ) cosh(2 log 2 + 2// A 2 ) . 
We can write 

Y _ Y = V«^-i4-i Vas ra _iCn 

" Tl_1_ (a + y„)(a + y„_!) a + y„ ' 

which gives, together with (21): 

A v / aM„_is 2 „ 1 ^i'A 2 as 2 _ 1 



E 



exp(±A(y„ - F n _i)) I < exp(±- 



(a + V n )(a + V n -i) (a + V n y 
As we have 



cosh(AF„) = -e Ay — 1 e 



L Ay„_ lp A(y„-y„_i) + I e -AY-„-i e -A(y-„-y-„_i) 



we derive: 



E 



cosh(AF„) I 



1 / X^/aM n ^ 1 sl_ 1 //A 2 as 2 _i 



(a + V n )(a + V n -i) ' (a + K) 2 
1 / A^A/^s 2 ^ //A_as|_ i \ 

2 6XP V Ar,l - 1+ + + + fa + K) 2 J' 



(a + V„)(a + V„_i) (a + 14) 2 

-p(^%)cosh((i-4^)Ar„_ 1 ) 

V (a + Vn) / \ a + Vn I 
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So, it remains to prove that 

n / /«'A 2 as 2 \ / s 2 \ 

< //A 2 exp(^'A 2 ) cosh(21og2 + 2^'A 2 ). 

We need the following lemma. 
Lemma 3. If A > 0, one has 

sup sup(e A,? cosh((l - rj)z) - cosh(z)) < Ar\& Ar] cosh(2 log 2 + 2A). 

>;e[o,i] 2 >o 

The proof of this Lemma is given in Section 6. Using Lemma 3 with 77 = s|_ 1 /(a- 
Vk) and A = fi'\ 2 a/ (a + V4), we obtain 

ex ^(a + V k )* ) C ° Sh ( (l _ " cosh ( XYk ~^ 



< ^^- e^ 2 cosh(21og2 + 2AV) i 

(« + v k y 



and (15) follows, since 



E as k-\ < I » , < 1 

=1 (a + ^) 2 -io (a + x) 2 ^- 1 - 

This concludes the proof of Theorem 2. □ 

5.3 Proof of Theorem 1 
5.3.1 Notations 

Let us fix A e (0, 2(1+7) )' to ^ e cnosen later. In the following we denote by C any 
constant which depends only on (A, /x, 7). Let us recall that on the event 

fi' := {L(M~ 1/2 < ^(M) n {W(H*) < u }, 

the bandwidths H* and are well defined, and let us we set for short 

V(A) = P(fi' n A). 

We use the following notations: for ft > and a > 0, take 

M(h):=J2-^l lXk _^ x] < h e k , Z{h,a):=^p§. (22) 



fe=l 



If ft, = ft., € we denote ft._ := hj+i and ft + := hj-i if j > 1. We will use repeatedly 
the following quantity: for io € N and t > 0, consider 

7r(i ,i) := P sup V~ 1/2 (^) sup Z (hi, aip(hij) > t , (23) 

L i>io aEl(hi) J 

where 

/(ft) := K- 2 ,<5 - 2 (ft/ft )- 2 n 

Note that this interval is related to the definition of W, see (7). The proof of Theo- 
rem 1 contains three main steps. Namely, 
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1. the study of the risk of the ideal estimator W{H*) 1 \f{H*) — /(x)|, 

2. the study of the risk WiH*)- 1 \f(H) - f(x)\ when {H* < H}, 

3. the study of the risk WiH*)- 1 \f(H) - f(x)\ when {H* > H}. 

These are the usual steps in the study of the Lepski's method, see [22, 23, 24, 25]. 
However, the context (and consequently the proof) proposed here differs significantly 
from the "usual" proof. 

5.3.2 On the event {H* < H} 

Recall that v > is the constant in front of the Lepski's threshold, see (4). Let us 
prove the following. 



Lemma 4. For all t > one has 

r*\-i 



and 



W(H*y l \f(H*) - f(x)\ > t < tt(0, (t - l)/2), 



H* < H,W(H*r L f(H) - f(x) > t <7r(0,(t-i/-l)/2). 



(24) 



(25) 



Proof. First, use the decomposition 



\f(H*)-f(x)\<\f(H*)-f(x)\ + 



\M(H*) 
L(H*) 



where we recall that f(h) is given by (5), and the fact that \ f(H*) - f(x)\ < W(H*), 
since W(H*) < W{H*) on {W{H*) < u a }. Then, use (8) to obtain L{H*f/ 2 > 
^{H^^WiH*)- 1 , so that 



< 



2\M(H*)\ 



\M(H* 

L(H*) ^ L(H*) + ^(H*)W(H*)- 2 

< 2W{H*)%l)- 1/2 {H*)Z(H* 1 W- 2 {H*)^{H*)) 1 



and 



W-\H*)^^<2^/ 2 {H*) sup Z{H*M{H*)) 
< 2 sup ip~ 1/2 (hj) sup Z(hj,aip(hj)), 

3>0 a£l{hj) 

this concludes the proof of (24). On {H* < H}, one has using (4) and (8): 

\f(H) - f(H*)\ < V (^(H*)/L(H*))^ 2 < vW{H*). 
Hence, since W{H*) < W(H*) on {W(H*) < u }, we have for all t > 0: 

¥'\h* < HjWiH*)- 1 ^^) - f{x)\ > t 
<F'\H* < H,W(H*)- 1 \f{H*) - f(x)\ >t-v), 



(26) 



and (25) follows using (24). 



□ 
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5.3.3 On the event {H* > H} 
Lemma 5. For any t,rj > 0, we have 



'[H* < T), 



sup 



|M(ft)| 



where we put 



H U0 <h<H*,hzn { L ( h )i>( h )) 



io(v) = minji £ N : hi < r/} . 



TJ5> t ) <T(iofa),t/2), 



Proof. Note that u{h) := (ip(h)/ 'UK)) 1 / 2 is decreasing, so h = H u (h\ for h £%, and 
note that 



|M(ft)| 



u(ft) 



(L(ft)V(ft)) 1 / 2 " V " J L{H u[h) ) 

If ft < if* then u(ft) = (V>(ft)/£(ft)) 1/2 > J^(ft) using (8), and W(ft) > e Q {h/h ) a °. 
So, it(ft) > £o(-ff«(h)/fto) ao when ft < if*. If ft > if„ 0! then it(ft) < uq using the 



definition of H Uo . This entails 



sup 

H Un <h<H* .hen 



\M(h)\ 



(L(h)iP(h))V* 

-i \M{H U ) 



< sup |u 



;u: H U <H* and S a (H u /h a ) a " < u < it }. 



Hence, for any it such that So(H u /ho) a ° < u < uq and f? u < ff* < r}, one has 
using (3): 



u -iMM <2tt -i 



|M(JT U ) 



L(F U ) L(H U ) + u-^j(H u ) 

= 2iP(H u )- 1 / 2 Z(H u ,u- 2 ^(H u )) 
< 2 sup tp(hi)~ 1/2 SU P Z(hi,u~ 2 ip(hj)). 

i:hi<rj 5o(hi/ho) a o<u<ua 



1/ao 



Lemma 6. For any s,£ > define 

Then, for all < s < t, we have: 

H* >H,W{H*)- l \f{H)-f{x)\ >t 



□ 



(27) 



<7T 



s - 1 



^o(^),j(-|))+-( o 'K£- 1 ))- 



Proof. Let < s < i. One has 

P'ffT >#,W(^*)- 1 |/(^)-/(s)| >f] 

< P'[fT > H, {L{H)/^{H)f/ 2 \f{H) - f{x)\ > s 

+ F'\H* > H,(^{H)/L(H)y/ 2 > (t/s)W(H*) 
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The first term is less than 7r(0, (s - l)/2), indeed, on {W(H*) < u ,H* > H} one 
has 

(L(H)/^(H))^\f(H) f(x)\ < {L{H)/i,{H))^\j{H) f(x)\ 

+ {L{H)i>{H))-V*\M{H)\ 

< (L(H)/^(H))^W(H) + (L(F)V.(^))- 1/2 |A/(ff)| 
<l + (L(F)V(ff))- 1/2 |A/(^)|, 

and the desired upper-bound follows from Lemma 5. Let us bound the second term. 
Consider 

cue {W(H*) <u ,H* > H,(tP(H)/L{H)) 1/2 > (t/s)W(H*)}. 
Due to the definition of H, see (4), there exits h' = h' u G [H uo , H] such that 

\f(h')-f(H + )\ >u^{h')/L{h')) 1 / 2 . 
But since h! < H < H* , one has 

'V^'h 1 / 2 trir m^|» l m 7,* M , W)l , \M(H+)\ 



< 2W{H 



L(h')J ^ ' '^>\-\" > ;v +/| ■ L(ft/) 

|M(/i')| , |M(£+)| 



< 



< 



2s fiP(H)\i/i \M(h')\ \M(H+)\ 



t \L(H)' L(h') L(H+) 

2s/i>(h')\V* , \M(h')\ , |M(£+)| 



So, since /i' < i? entails (for such an u) that {ifj{h')l L^ 1 )) 1 / 2 > {^{H)/L{H)f/ 2 > 
(t/s)W(H*), we obtain 

\M(h')\ | |A/(g + )| ^ / ggy^')^/ 



2s\ r /^(^') \ i/ 2 i. 



> 



and therefore 



f |M(/t)| 1/ _2£Xi 

In addition, because oi H > H Uo one has 

6 (H*/h ) a ° < W(H*) < ( s /t)(iiH)/L(H))^ 2 < (s/t)u , 
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so H* < r) s ,t, where r) Syt is given by (27). We have shown that 
{W(H*) < u ,H* > H, (^f|) 1/2 > -W(H*)} 
C <H < r) B<t , 



sup 



r \M(H*)\ fvt \ - . „. -i 



H UQ <h<H* ,hen 
vt _\ 

s 



(L(h)i>(h)Y /2 



2s 



)} 



L{H*) 

and we conclude using Lemma 5 and (26). □ 
5.3.4 Finalization of the proof 

In order to conclude the proof of Theorem 1, we need the following uniform version 
of Theorem 2: under the same assumptions as in Theorem 2, we have for any < 
ao < 0,1' 



E 



sup exp 



'A aMff 



(-- 

\2<a 



o£[o ,oi] V2 (a + Vat) 



< (l + c A )(l + log(ai/a )). 



(28) 



Indeed, since 



d aM% 



da (a + V N ) 2 (a + V N ) 3 



Ml 



:(V N -a) 



(a + V N f 



we have 



/■Ol 

sup exp(Ar a /2) < exp(AF ao /2) + / a' 1 exp(XY a /2)XY a /2 da 

a£[ao,ai] J ao 

< cxp(Ar ao ) + / a- 1 cxp(Ay a ) da, 

so (28) follows taking the expectation and using Theorem 2. Now, using (28) with 

1 



CTfc-l 



l|X fc _i-x|<h) Ck — Efc/Cfe-l 



we obtain 



E 



exp((A/2) sup Z(h,af) < C(l + bg(oi/ao)), 

a€[ao,ai] 



where we recall that Z(h, a) is given by (22). So, using Markov's inequality, we arrive, 
for all h > 0, at > ao > and t > 0, at: 



sup Z(h,a)>t < C(l + log(a 1 /a ))e- A * 2/2 . 

a£[ao,ai] 



(29) 



A consequence of (29), together with an union bound, is that for all io £ N and t > 0: 
7r(*o,*) < Ce" A * 2/2 £ (hi/h ) bM2/2 (l + 2 log(u /<*o) + 2a Iog(fto/fc)) , (30) 



i>in 



where we recall that ir(io,t) is given by (23). 
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Now, it remains to use what the grid H is. Recall that for some q G (0, 1), we 
have hi = h$q l and we denote by C any positive number which depends only on 
A, n, 7, q, b, uo,6o,a>o, v. Using together (25) and Lemma 6, one gets for < s < t: 

F'lWiHTW) - f(x)\ >t]< tt(0, t —^ r 1 ) + tt(0, ^) 

1 / 2s\\ /„ 1 rvt 



+ ^ io (,. it ), 3 ^- T jj+^o > ^--i 

and using (30), we have for any u > 0, io G N: 

n(i ,u) < Ce- Xu2 / 2 (i + \) q ^u 2 /2_ 
Recalling that rj Si t is given by (27) and that io(r/) = min{i G N : hi < rj}, we have 

log(5 /u ) + log{t/s) log(5 Q /u ) +log{t/s) 

-, — 777-7 < io(Vs,t S -, — 777-7 h \ 6 

a log(l/g) a log(l/q) 



Now, recall that < p < 64t ^(i +7 ) an d consider s = yj (8plogt)/A + 1. When i is 
large enough, we have s < t and: 

tt(0,^) <C lt ^, ^C^pi) <C 2 exp(-C 2 i 2 ), 
^(O^f? - 0) ^ C 3 exp(-^(t/log0 2 ), 



2 V2s 

for constants Cj, Cj' that depends on A, b, v, S , u , «o, 9- For the last probability, we 
have: 



*(<ofo.,*), - A (» ~ { )) < ^xp ( - A ^" 3 ; a/tJ ) (iofo.,,) + 1) 

/ i (Vs.t)b\(v-2s/t) 2 \og(l/q) 
X 6XP V 32 
and by taking A G (0, 2(1+7)) an< ^ ^ l ar S e enough, one has 

b\(v- 2s/t) 2 

32a >P ' 

so we obtain together with (31): 

2s ^ < c _(log(i + l)) 1+ " /2 



7r(«o(%,t),| 



^ — 



t J J ~ tP 

when t is large enough. This concludes the proof of Theorem 1. □ 
5.4 Proof of Proposition 1 

Let us denote for short Ih = [x — h, x + h] . Recall that h w is well-defined when 
n > a 2 / {Px\Iho\ w {ho) 2 ) , an d that H w is well defined on the event 

fio = {L(ho) > w(h )~ 2 }. 

So, from now on, we suppose that n is large enough, and we work on Q . We need 
the following Lemma, which says that, when L(h w ) and KL(h w ) are close, then H w 
and hw are close. 
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Lemma 7. If Assumption 4 holds, we have for any < e < 1 that on Oq- 
{L(h w )>^^} d{H w <(l+e)h w } and 
{L(h w ) < C {H w > (1 - e)h w }, 

when n is large enough. 

The proof of Lemma 7 is given in Section 6 below. We use also the next Lemma 
from [2] (see Claim 2, p. 858). It is a corollary of Bcrbcc's coupling lemma [3], that 
uses a construction from the proof of Proposition 5.1 in [32], see p. 484. 

Lemma 8. Grant Assumption 6. Let q, q\ be integers such that < q\ < q/2, q\ > 1. 
Then, there exist random variables (X*)" =1 satisfying the following: 

• For j = 1, . . . , J := [n/q], the random vectors 

Uj t i := (X(j_i) q+ i, . . . ,X^_ l - )q+qi ) and U* x := (X ( * _ 1)(J+1 , . . . ,X^_^ q+qi ) 
have the same distribution, and so have the random vectors 

Uj.2 ■= (X(j_i) g+qi+1 , . . . ,Xj q ) and U* 2 := (X^_^ q+qi+1 , . . . ,X* q ). 

• For j = 1,..., J, 

nUj,i + U* A ] < f3 q . qi and P[C/ i)2 jt U* >2 ] < f3 qi . 

• For each k = 1,2, the random vectors U± k , . . . ,Uj k are independent. 

In what follows, we take simply q\ = [q/2] + 1, where [x] stands for the integral 
part of x, and introduce the event fT = {Xi = X*,Vi = 1, ...,n}. Assume to 
simplify that n = Jq. Lemma 8 gives 

p[(rr) c ] < j(f3 q _ qi + p„_ 9l ) < 2j(3 [q/2] < ^ML. (32) 

Then, denote for short L*(h) — r i|<d, and note that, using Lemma 7, 

we have, for z := 1 — 1/(1 + e) s : 

{H w > (1 + e) s h w } nn*nf] C {L*{h w ) - EL(h w ) > zEL(h w )} 

1 " 

= {-J2^U-\<^ - p x[hJ) > zPxlhj}- 

i=l 

Use the following decomposition of the sum: 

1 n 1 3 

i=l 

where for k £ {1, 2}, we put 
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where := {(j-l)q+l, . . . , (j- l)q+qi] and I ja := {(j -l)q+qi + l, . . . ,jq}. For 
k G {1, 2}, we have using Lemma 8 that the variables (Zj^j—i are independent, cen- 
tered, such that HZ^fcHoo < 1/2 and E[Z? fc ] < P x [IhJ\/4- So, Bernstein's inequality 
gives 



n n* n n 



< 2 exp 



nPx[Ih„ 



2(l + z/3) g 



and doing the same on the other side gives for any z G (0, 1) 



( 1 



(i + z y/s hw ~ Hw ~ "(l - z y 



^.} C n^]<4exp(-- 
So, when n is large enough, we have 

CnP x [I h , 



nP x [lK 



[h w /2 < H w < 2h w ] > 1 - 4 exp 



(1 + 2/3) q 
2n,(3[ q / 2 ] 



(33) 



But, since on [0, ho] w is increasing and w(h) = h s i w (h) where l w is slowly varying, 
we have {h w /2 < H w < 2h w } C {w(h w )/4: < w(H w ) < 4w(h w )} when n is large 
enough. Now, Lemma 1 and Assumption 5 gives that 

nP x [I h J=n 2s /( 2s+T+1 h(l/n), 

where I is a slowly varying function that depends on £ x , t w , s, r and a. When 
the /3-mixing is geometric, we have ip~ 1 (p) = exp((p/?7) 1 / K ), so the choice q = 

ti 2sk/((2s+t+1)(k+1)) i m plj es 

< W (H W ) < 4w(h w )} nn > 1 - cxp(-Cin ,5l £i(l/n)). 

When the mixing is arithmetic, we have ip~ 1 (p) = (p/t]) 1 '"", so the choice q = 
n 2s/{2s+r+i)^ l / n )/(i og nf implies 

{ w{t^) < w ^ < 4 w ( hw )} n O ] > 1 - C 2 n~ s H 2 (l/n). 

So, it only remains to control the probability of fV Using the same coupling argument 
as before together with Bernstein's inequality, we have when n is large enough: 



¥[L(h ) < w(h Q )- 2 } = W[L(h ) - EL{h ) < w(h ) 



EL(h )} 



< 



F(L(h )-EL(h )<-^M) 
) 



<exp(-C 2 nPA ' [// '" ] U 2 "^ /21 



So, when the /3-mixing is geometric, the choice q = n K /' K+1 ) implies that P[f2[j] < 
exp(— Cin 1 /( K+1 ' 1 ) = o(ipn). When the mixing is arithmetic, we have ip^ 1 (p) = 
(p/»7) 1/f \ so the choice q = n/(\ogn) 2 gives P[n§] < C 2 (logn) 2 n~ 1/K = o(ip n ). This 
concludes the proof of Proposition 1. □ 
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5.5 Proof of Corollary 1 

Let us fix p G (p, i28(T+7) ) ( n °t e that ao = 2 under Assumption 4). Using Assump- 
tion 4, one can replace W by u> in the statement of Theorem 1. This gives 



{\f(H)-f( x )\>tw(H*)}nn 



<c Oo g(i + 1))P/2+1 



for any t > to, where we recall that f^o = {L(ho)^ 1 ^ 2 < w(ho)}, and where 



H* := min 



Recall the definition (17) of H w , and note that by construction of H, one has that 
H w < H* < q~ 1 H w . So, on the event {H w < 2h w }, one has, using the fact that w 
is ,s-regularly varying, that w(H*) < w(2q^ 1 h w ) < 2(2/q)"w(h w ) for n large enough. 
So, putting for short A := {H w < 2h w } D f^o, we have 



>[{|/(ff)-/(i)| > Cl tw(h w )}n 



<r (io g (t+i)r/ 2+i 



tp 



for any t > to , where ci = 2(2/q) s . Since p > p, we obtain, by integrating with 
respect to i, that 

E[KM _1 (/(tf) - f(x))\ p U] < Ci, 

where Ci is a constant depending on Co, io, 9: P: s, p. Now, it only remains to observe 
that using Proposition 1, F(A^) < 2ip n , and that ip n = o(w(h w )) in the geometrically 
/3-mixing case, and in the arithmetically /3-mixing when k < 2s/ (s + r + 1). □ 

6 Proof of the Lemmas 
6.1 Proof of Lemma 7 

For large enough, we have ^((1 + e)h w )/£ w ((l + e)h w ) 2 < (1 + e) 8 ip(h w )/£ w (h w ) 2 
since ip/i^ is slowly varying. So, 

m+e)h w ) 1 jp) = -±-EL(h w ). 



w((i + s)h w y - (i + e )» w {h w y (i + £ y 

On the other hand, by dchnition of H w , we have 

H(l+e)h v 
>{{l+e)h w y 

and L((l + e)h w ) > L(h w ), so we proved that the embedding 

L{h w ) 
.EL(h w ) ~ (1+ey 

holds when n is large enough. The same argument allows to prove that 

L(h w ) 
.EL(h w ) 

which concludes the proof of the Lemma. □ 



{H w < (l+s)h w } = {L((l+e)h w ) > ifc*^}, 
I w((l + e)h w V ) 

L{h w ), so we proved that the embedding 
;e enough. The same argument allows to pro 
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6.2 Proof of Lemma 2 

Take m £ [0, fi) and pel. Note that e v < 1 + ye y < 1 + y + y 2 e y for any y > 0, so 
e ™C 2 +PC < e"C + m ( 2 e ^+^ 

< l + /< + (p 2 + ™)C 2 e TnC2+pC , 

and 

E [ e mC 2 + P C] < l + (p 2 + m)E[C 2 e mc2+pC ] , (34) 

since E£ = 0. Take mi G (m, /_t). Since p£ < ep 2 /2 + ( 2 /(2e) for any e > 0, we obtain 
for e = [2(mx — m)] _1 : 

e ™C 2 +PC < exp ( _l_ ) e »< 

Together with 

q 2 < _ c (iJ.-mi)c 2 
~ n — n%i 

and the definition of fi, this entails 



Thus, 



E[<V*-+"] < -^exp( — -). 

/i — mi 4(mi — m) 



E[e ^ + K]< 1+ 7(p 2 + m) exp( _ p 2 



/i — mi 4(mi — m) ' 

7(P 2 + / p 2 + m , 

< 1 H exp( 



fi — m\ 4(mi — m) 

For the choice mi = + 27) + 27m/ (1 + 27) one has 7/(/i— mi) = l/[2(mi — m)], 
so the Lemma follows using that 1 + ye 2 '/ 2 < e y for all y > 0. This concludes the 
proof of the Lemma. □ 

6.3 Proof of Lemma 3 

Let 77 £ [0, 1] and z £ M + be such that e Av cosh((l — rj)z) — cosh(z) > 0. Let us show 
that one has 

z<21og2 + 2A (35) 

Since cosh(.z)/ cosh((l — r))z) > e nz /2 one has z < ? / , ~ 1 log2 + A. Thus (35) holds 
if j] > 1/2. If 77 < 1/2 and z > log(3), it is easy to check that the derivative of 
x 1 — y cosh((l — x)z)e' lx / 2 is non-positive, hence cosh(z) > e' ,z / 2 cosh((l — rf)z) in this 
case. Thus, we have either z < log(3) or z < 2A which yields (35) in every case. 
Finally, from (35), we easily derive 

e Ar > cosh((l - n)z) - cosh(z) = cosh((l - V )z) ( e Ar ' ) 

V cosn((l — r\)z) I 

< cosh(z)(e A ' n - 1) 

< cosh(21og(2) + 2A)Arie Arl . 

This concludes the proof of the Lemma. □ 
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